72 research outputs found
HAR-Net: Joint Learning of Hybrid Attention for Single-stage Object Detection
Object detection has been a challenging task in computer vision. Although
significant progress has been made in object detection with deep neural
networks, the attention mechanism is far from development. In this paper, we
propose the hybrid attention mechanism for single-stage object detection.
First, we present the modules of spatial attention, channel attention and
aligned attention for single-stage object detection. In particular, stacked
dilated convolution layers with symmetrically fixed rates are constructed to
learn spatial attention. The channel attention is proposed with the cross-level
group normalization and squeeze-and-excitation module. Aligned attention is
constructed with organized deformable filters. Second, the three kinds of
attention are unified to construct the hybrid attention mechanism. We then
embed the hybrid attention into Retina-Net and propose the efficient
single-stage HAR-Net for object detection. The attention modules and the
proposed HAR-Net are evaluated on the COCO detection dataset. Experiments
demonstrate that hybrid attention can significantly improve the detection
accuracy and the HAR-Net can achieve the state-of-the-art 45.8\% mAP,
outperform existing single-stage object detectors
On reformulated zagreb indices with respect to tricyclic graphs
The authors Milievi et al. introduced the reformulated
Zagreb indices, which is a generalization of classical Zagreb indices of
chemical graph theory. In the paper, we characterize the extremal properties of
the first reformulated Zagreb index. We first introduce some graph operations
which increase or decrease this index. Furthermore, we will determine the
extremal tricyclic graphs with minimum and maximum the first Zagreb index by
these graph operations.Comment: 8 pages,2 figure
DeepDeblur: Fast one-step blurry face images restoration
We propose a very fast and effective one-step restoring method for blurry
face images. In the last decades, many blind deblurring algorithms have been
proposed to restore latent sharp images. However, these algorithms run slowly
because of involving two steps: kernel estimation and following non-blind
deconvolution or latent image estimation. Also they cannot handle face images
in small size. Our proposed method restores sharp face images directly in one
step using Convolutional Neural Network. Unlike previous deep learning involved
methods that can only handle a single blur kernel at one time, our network is
trained on totally random and numerous training sample pairs to deal with the
variances due to different blur kernels in practice. A smoothness
regularization as well as a facial regularization are added to keep facial
identity information which is the key to face image applications. Comprehensive
experiments demonstrate that our proposed method can handle various blur kenels
and achieve state-of-the-art results for small size blurry face images
restoration. Moreover, the proposed method shows significant improvement in
face recognition accuracy along with increasing running speed by more than 100
times
CS-R-FCN: Cross-supervised Learning for Large-Scale Object Detection
Generic object detection is one of the most fundamental problems in computer
vision, yet it is difficult to provide all the bounding-box-level annotations
aiming at large-scale object detection for thousands of categories. In this
paper, we present a novel cross-supervised learning pipeline for large-scale
object detection, denoted as CS-R-FCN. First, we propose to utilize the data
flow of image-level annotated images in the fully-supervised two-stage object
detection framework, leading to cross-supervised learning combining
bounding-box-level annotated data and image-level annotated data. Second, we
introduce a semantic aggregation strategy utilizing the relationships among the
cross-supervised categories to reduce the unreasonable mutual inhibition
effects during the feature learning. Experimental results show that the
proposed CS-R-FCN improves the mAP by a large margin compared to previous
related works
Intention Oriented Image Captions with Guiding Objects
Although existing image caption models can produce promising results using
recurrent neural networks (RNNs), it is difficult to guarantee that an object
we care about is contained in generated descriptions, for example in the case
that the object is inconspicuous in the image. Problems become even harder when
these objects did not appear in training stage. In this paper, we propose a
novel approach for generating image captions with guiding objects (CGO). The
CGO constrains the model to involve a human-concerned object when the object is
in the image. CGO ensures that the object is in the generated description while
maintaining fluency. Instead of generating the sequence from left to right, we
start the description with a selected object and generate other parts of the
sequence based on this object. To achieve this, we design a novel framework
combining two LSTMs in opposite directions. We demonstrate the characteristics
of our method on MSCOCO where we generate descriptions for each detected object
in the images. With CGO, we can extend the ability of description to the
objects being neglected in image caption labels and provide a set of more
comprehensive and diverse descriptions for an image. CGO shows advantages when
applied to the task of describing novel objects. We show experimental results
on both MSCOCO and ImageNet datasets. Evaluations show that our method
outperforms the state-of-the-art models in the task with average F1 75.8,
leading to better descriptions in terms of both content accuracy and fluency
R(Det)^2: Randomized Decision Routing for Object Detection
In the paradigm of object detection, the decision head is an important part,
which affects detection performance significantly. Yet how to design a
high-performance decision head remains to be an open issue. In this paper, we
propose a novel approach to combine decision trees and deep neural networks in
an end-to-end learning manner for object detection. First, we disentangle the
decision choices and prediction values by plugging soft decision trees into
neural networks. To facilitate effective learning, we propose randomized
decision routing with node selective and associative losses, which can boost
the feature representative learning and network decision simultaneously.
Second, we develop the decision head for object detection with narrow branches
to generate the routing probabilities and masks, for the purpose of obtaining
divergent decisions from different nodes. We name this approach as the
randomized decision routing for object detection, abbreviated as R(Det).
Experiments on MS-COCO dataset demonstrate that R(Det) is effective to
improve the detection performance. Equipped with existing detectors, it
achieves \% AP improvement.Comment: 10 pages, 5 figures; Accepted by CVPR202
Progressive Representation Adaptation for Weakly Supervised Object Localization
We address the problem of weakly supervised object localization where only
image-level annotations are available for training object detectors. Numerous
methods have been proposed to tackle this problem through mining object
proposals. However, a substantial amount of noise in object proposals causes
ambiguities for learning discriminative object models. Such approaches are
sensitive to model initialization and often converge to undesirable local
minimum solutions. In this paper, we propose to overcome these drawbacks by
progressive representation adaptation with two main steps: 1) classification
adaptation and 2) detection adaptation. In classification adaptation, we
transfer a pre-trained network to a multi-label classification task for
recognizing the presence of a certain object in an image. Through the
classification adaptation step, the network learns discriminative
representations that are specific to object categories of interest. In
detection adaptation, we mine class-specific object proposals by exploiting two
scoring strategies based on the adapted classification network. Class-specific
proposal mining helps remove substantial noise from the background clutter and
potential confusion from similar objects. We further refine these proposals
using multiple instance learning and segmentation cues. Using these refined
object bounding boxes, we fine-tune all the layer of the classification network
and obtain a fully adapted detection network. We present detailed experimental
validation on the PASCAL VOC and ILSVRC datasets. Experimental results
demonstrate that our progressive representation adaptation algorithm performs
favorably against the state-of-the-art methods.Comment: Project page: https://sites.google.com/site/lidonggg930/ws
Intra-clip Aggregation for Video Person Re-identification
Video-based person re-identification has drawn massive attention in recent
years due to its extensive applications in video surveillance. While deep
learning-based methods have led to significant progress, these methods are
limited by ineffectively using complementary information, which is blamed on
necessary data augmentation in the training process. Data augmentation has been
widely used to mitigate the over-fitting trap and improve the ability of
network representation. However, the previous methods adopt image-based data
augmentation scheme to individually process the input frames, which corrupts
the complementary information between consecutive frames and causes performance
degradation. Extensive experiments on three benchmark datasets demonstrate that
our framework outperforms the most recent state-of-the-art methods. We also
perform cross-dataset validation to prove the generality of our method.Comment: Due to the privacy issue of person re-ID, we require to withdraw the
previous version of this pape
Perceive Where to Focus: Learning Visibility-aware Part-level Features for Partial Person Re-identification
This paper considers a realistic problem in person re-identification (re-ID)
task, i.e., partial re-ID. Under partial re-ID scenario, the images may contain
a partial observation of a pedestrian. If we directly compare a partial
pedestrian image with a holistic one, the extreme spatial misalignment
significantly compromises the discriminative ability of the learned
representation. We propose a Visibility-aware Part Model (VPM), which learns to
perceive the visibility of regions through self-supervision. The visibility
awareness allows VPM to extract region-level features and compare two images
with focus on their shared regions (which are visible on both images). VPM
gains two-fold benefit toward higher accuracy for partial re-ID. On the one
hand, compared with learning a global feature, VPM learns region-level features
and benefits from fine-grained information. On the other hand, with visibility
awareness, VPM is capable to estimate the shared regions between two images and
thus suppresses the spatial misalignment. Experimental results confirm that our
method significantly improves the learned representation and the achieved
accuracy is on par with the state of the art.Comment: 8 pages, 5 figures, accepted by CVPR201
Learning Structured Semantic Embeddings for Visual Recognition
Numerous embedding models have been recently explored to incorporate semantic
knowledge into visual recognition. Existing methods typically focus on
minimizing the distance between the corresponding images and texts in the
embedding space but do not explicitly optimize the underlying structure. Our
key observation is that modeling the pairwise image-image relationship improves
the discrimination ability of the embedding model. In this paper, we propose
the structured discriminative and difference constraints to learn
visual-semantic embeddings. First, we exploit the discriminative constraints to
capture the intra- and inter-class relationships of image embeddings. The
discriminative constraints encourage separability for image instances of
different classes. Second, we align the difference vector between a pair of
image embeddings with that of the corresponding word embeddings. The difference
constraints help regularize image embeddings to preserve the semantic
relationships among word embeddings. Extensive evaluations demonstrate the
effectiveness of the proposed structured embeddings for single-label
classification, multi-label classification, and zero-shot recognition.Comment: 9 pages, 6 figures, 5 tables, conferenc
- …